J(θ) θ
J(θ) = E
(x,y)ˆp
L(f(x; θ), y),
L f(x; θ)
x y ˆp
p
J
(θ) = E
(x,y)p
L(f(x; θ), y).
P
J(θ)
P J
x y
p(x, y) L(x, y)
E
x,yp(x,y)
[L(x, y)]
p
p(x, y)
p(x, y)
p(x, y) ˆp(x, y)
E
x,yˆp(x,y)
[L(f(x; θ), y)] =
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
)
m
θ
ML
= arg max
θ
m
i=1
log p (x
(i)
; θ).
J(θ) = E
xˆp
log p (x; θ).
J
θ
J(θ) = E
xˆp
θ
log p (x; θ).
n
ˆσ/
n, ˆσ
m
1
m
g
H H
1
g
H
g
H
1
g H
J(x)
x
(x, y)
p(x, y)
y
L(f(x; θ), y) p(x, y)
θ
f(·; θ) θ
J
(θ) =
L(f(x; θ), y)dp(x, y)
p
g =
J
(θ)
θ
=
L(f(x; θ), y)
θ
dp(x, y),
(x, y)
p
ˆ
g =
L(f(x; θ), y)
θ
.
ˆ
g
g θ
ˆ
g
θ
= θ αg α
g =
θ
J(θ).
J(θ
) J(θ) αg
g +
1
2
α
2
gHg
H J θ αg
g
1
2
α
2
gHg
H
g g
H
g
g g
Hg
gHg
i
j
m n n!
m
α 1
(m ×n)
n n
n
θ
θ
θ
f = f
T
f
T 1
. . . , f
2
f
1
f(x) x
f
= f
T
f
T 1
. . . , f
2
f
1
f
=
f(x)
x
f
t
=
f
t
(a
t
)
a
t
,
a
t
= f
t1
(f
t2
(. . . , f
2
(f
1
(x))))
y x
α α
T
α < 1 α > 1 T
T
T
T e
T
x log x T
W
x
1
, . . . , x
t
, . . .
s
t
= F
θ
(s
t1
, x
t
)
s
t
F
θ
o
t
= g
ω
(s
t
),
L
t
t o
t
y
t
L
T
T
θ F
θ
L
T
θ
L
T
θ
=
tT
L
T
s
t
s
t
θ
L
T
θ
=
tT
L
T
s
T
s
T
s
t
F
θ
(s
t1
, x
t
)
θ
θ
F
θ
s
t
= F
θ
(s
t1
, x
t
)
θ s
t1
θ s
t
L
T
s
T
s
T
s
t
s
T
s
t
=
s
T
s
T 1
s
T 1
s
T 2
. . .
s
t+1
s
t
L
T
θ
T t
t T
t 1 t
s
t
s
t1
θ
[x
(t)
, y
(t)
]
θ θ +
θ
t
L(f(x
(t)
; θ), y
(t)
; θ),
1/L L
µ L
(1
µ
L
)
µ
O(1/k) k
L
µ
µ
m
E[
ˆ
g]
E[
ˆ
g] = g,
g
m = 1
m > 1 m
η
k
k
η
θ
m {x
(1)
, . . . , x
(m)
}
ˆ
g = 0
i = 1 m
ˆ
g
ˆ
g +
θ
L(f(x
(i)
; θ), y
(i)
)/m
θ θ
k
η
ˆ
g
m
k=1
η
k
= , and
k=1
η
2
k
< .
k µ
L
O((1
µ
L
)
k
)
µ
O(1/k) O(1/k
2
)
O(1/
k) O(1/k)
O(1/k)
O()
−30 −20 −10 0 10 20
−30
−20
−10
0
10
20
v
v +αv + η
θ
1
m
m
t=1
L(f(x
(t)
; θ), y
(t)
)
θ θ + v
v
θ
1
n
n
t=1
L(f(x
(t)
; θ), y
(t)
)
α η
α η
η α
θ v
m {x
(1)
, . . . , x
(m)
}
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
v αv ηg
θ θ + v
v +αv + η
θ
1
m
m
t=1
L
f(x
(t)
; θ + αv), y
(t)
,
θ θ + v,
α η
O(1/k) k O(1/k
2
)
O(1
µ
L
) O(1
µ
L
)
η
θ
1
m
m
i=1
L
f(x
(i)
; θ), y
(i)
η
θ
1
m
m
i=1
L
f(x
(i)
; θ + αv), y
(i)
αv + η
θ
1
m
m
i=1
L
f(x
(i)
; θ + αv), y
(i)
αv
Standard momentum
Nesterov correction term
Nesterov accumulated gradient
η α
θ v
m {x
(1)
, . . . , x
(m)
}
θ θ + αv
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
v αv ηg
θ θ + v
η
θ
r = 0
m {x
(1)
, . . . , x
(m)
}
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
r r + g
2
θ
η
r
g
1
r
θ θ + θ
t
ρ
η ρ
θ
r = 0
m {x
(1)
, . . . , x
(m)
}
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
r ρr + (1 ρ)g
2
θ =
η
r
g
1
r
θ θ + θ
η ρ α
θ v
r = 0
m {x
(1)
, . . . , x
(m)
}
θ θ + αv
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
r ρr + (1 ρ)g
2
v αv
η
r
g
1
r
θ θ + v
θ
j
{x
(i)
, y
(i)
}
θ
j
=
1
2
θ
2
j
L(f(x
(i)
; θ
0
), y
(i)
)
θ
j
L(f(x
(i)
; θ
0
), y
(i)
)
1
2
θ
2
j
L(f(x
(i)
; θ
0
), y
(i)
)
=
θ
j
θ
j
L(f(x
(i)
; θ
0
), y
(i)
)
θ
j
θ
j
θ
0
L(f(x
(i)
; θ
0
+ e
j
θ
j
), y
(i)
) L(f (x
(i)
; θ
0
), y
(i)
) + e
j
θ
j
L(f(x
(i)
; θ
0
), y
(i)
) θ
j
+
e
j
1
2
2
θ
2
j
L(f(x
(i)
; θ
0
), y
(i)
) θ
2
j
θ
j
θ
j
θ
j
=
θ
j
L(f(x
(i)
;θ
0
),y
(i)
)
2
θ
2
j
L(f(x
(i)
;θ
0
),y
(i)
)
α
ρ
1
ρ
2
θ
s = 0 r = 0
t = 0
m {x
(1)
, . . . , x
(m)
}
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
t t + 1
s ρ
1
s + (1 ρ
1
)g
r ρ
2
r + (1 ρ
2
)g
2
ˆ
s
s
1ρ
t
1
ˆ
r
r
1ρ
t
2
θ = α
s
r+
g
θ θ + θ
ρ
θ
r = 0 s = 0
m {x
(1)
, . . . , x
(m)
}
g = 0
i = 1 m
g g +
θ
L(f(x
(i)
; θ), y
(i)
)
r ρr + (1 ρ)g
2
θ =
s+
r+
g
s ρs + (1 ρ) [∆θ]
2
θ θ + θ
J(θ) = E
x,yˆp(x,y)
[L(f(x; θ), y)] =
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
).
J(θ) H(J)(θ)
H
H(J)(θ)
i,j
=
2
θ
i
θ
j
J(x; θ).
θ J
J(θ) θ
0
J(θ) J(θ
0
) + (θ θ
0
)
θ
J(θ
0
) +
1
2
(θ θ
0
)
H(J)(θ
0
)(θ θ
0
).
θ
= θ
0
[H (J(θ
0
))]
1
θ
J(θ
0
)
H
H
1
J(θ) =
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
)
θ
0
g = 0
H = 0
i = 1 m
g g +
1
m
θ
L(f(x
(i)
; θ), y
(i)
)
H H +
1
m
2
θ
L(f(x
(i)
; θ), y
(i)
)
H
1
θ
t
= H
1
g
θ
t+1
= θ
t
+ θ
t
H
H H =
QΛQ
Q
θ
φ φ = Λ
1
2
Q
θ
φ
θ
f(θ
0
)
φ
f(θ
0
)
φ
f(θ
0
) =
θ
φ
θ
f(θ
0
)
= QΛ
1
2
θ
f(θ
0
)
θ
f(θ
0
) = Λ
1
2
Q
φ
f(θ
0
) H
φ
f(θ) f(θ
0
) + (θ θ
0
)
θ
f(θ
0
) +
1
2
(θ θ
0
)
H (θ θ
0
)
= f(θ
0
) + (θ θ
0
)
QΛ
1
2
φ
f(θ
0
) +
1
2
(θ θ
0
)
QΛQ
(θ θ
0
)
= f(θ
0
) +
Λ
1
2
Q
θ Λ
1
2
Q
θ
0
φ
f(θ
0
)
+
1
2
Λ
1
2
Q
θ Λ
1
2
Q
θ
0
Λ
1
2
Q
θ Λ
1
2
Q
θ
0
= f(θ
0
) + (φ φ
0
)
φ
f(θ
0
) +
1
2
(φ φ
0
)
(φ φ
0
) .
φ
φ
φ
= φ
0
φ
f(θ
0
),
φ
θ φ
Q i
Q
:,i
Q
:,i
= 1 Q
i,:
Q
i,:
= 1 i = j Q
:,i
Q
:,j
= 0 Q
i,:
Q
j,:
= 0
0
0
0
0
QΛ
1
2
θ
2
θ
1
Λ
1
2
Q
θ
2
θ
1
φ
1
φ
2
φ
1
φ
2
θ φ
α
θ
= θ
0
[H (f (θ
0
)) + αI]
1
θ
f(θ
0
).
α
α αI
K
K
K ×K O(K
3
)
d
t1
d
t1
θ
J(θ) ·
d
t1
= 0
d
t
=
θ
J(θ) d
t1
d
t
d
t1
d
t1
d
t
t d
t
d
t
=
θ
J(θ) + β
t
d
t1
β
t
d
t1
d
t
d
t1
d
t
H(J)d
t1
= 0
−30 −20 −10 0 10 20
−30
−20
−10
0
10
20
φ
d
t
Hd
t1
= 0
d
t
QΛQ
d
t1
= 0
Λ
1
2
Q
d
t1
Λ
1
2
Q
d
t1
= 0
d
(φ)
t
d
(φ)
t1
= 0,
d
(φ)
t1
= Λ
1
2
Q
d
t
d
t1
φ
φ
H
β
t
β
t
=
θ
J(θ
t
)
θ
J(θ
t
)
θ
J(θ
t1
)
θ
J(θ
t1
)
β
t
=
(
θ
J(θ
t
)
θ
J(θ
t1
))
θ
J(θ
t
)
θ
J(θ
t1
)
θ
J(θ
t1
)
k
k
θ
0
ρ
0
= 0
g
t
= 0
i = 1 m
g
t
g
t
+
1
m
θ
L(f(x
(i)
; θ), y
(i)
)
β
t
=
(g
t
g
t1
)
g
t
g
t1
g
t1
ρ
t
= g
t
+ β
t
ρ
t1
η
= argmin
η
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
)
θ
t+1
= θ
t
+ η
ρ
t
θ
= θ
0
[H (J(θ
0
))]
1
θ
J(θ
0
).
H(J)(θ
0
)
M
t
H(J)
t
θ
t+1
θ
t
= H
1
(
θ
J(θ
t+1
)
θ
J(θ
t
))
M H
1
M
M
t
= M
t1
+
1 +
φ
M
t1
φ
φ
φ
φ
φ
φ
M
t1
+ M
t1
φ
φ
,
g
t
=
θ
J(θ
t
) φ = g
t
g
t1
= θ
t
θ
t1
θ R
n
O(n
2
)
M
t
ρ
t
ρ
t
= M
t
g
t
η
θ
t+1
= θ
t
+ η
ρ
t
.
θ
0
M
0
= I
g
t
=
θ
J(θ
t
)
φ = g
t
g
t1
= θ
t
θ
t1
H
1
M
t
= M
t1
+
1 +
φ
M
t1
φ
φ
φ
φ
φ
φ
M
t1
+M
t1
φ
φ
ρ
t
= M
t
g
t
η
= argmin
η
J(θ
t
+ ηρ
t
)
θ
t+1
= θ
t
+ η
ρ
t
M O(n
2
)
M M
t1
ρ
t
= g
t
+ b + aφ,
a b
a =
1 +
φ
φ
φ
g
t
φ
+
φ
g
t
φ
b =
g
t
φ
φ
φ
J(θ) αJ(θ)
θ
x
y x
N
N
arg min
θ
E
ˆp
[log p
θ+∆θ
(x)]
s.t.KL (p
θ
(x)p
θ+∆θ
(x)) = ∆KL.
∆KL
θ 0 X
log p
θ+∆θ
θ
log p
θ+∆θ
log p
θ
+ (log p
θ
)
θ +
1
2
θ
2
log p
θ
θ.
X
p
θ
θ
log p
θ
=
X
θ
p
θ
=
θ
X
p
θ
=
θ
1 = 0,
(p
θ
p
θ+∆θ
)
(p
θ
p
θ+∆θ
) =
X
p
θ
log p
θ
X
p
θ
log p
θ+∆θ
X
p
θ
log p
θ
X
p
θ
log p
θ
+ (log p
θ
)
θ +
1
2
θ
2
log p
θ
θ
=
1
2
θ
E
p
θ
2
log p
θ
θ
(log p
θ
)
θ
E
p
θ
−∇
2
log p
θ
log p
θ
E
p
θ
(log p
θ
)
(log p
θ
)
.
0 =
2
θ
X
p
θ
=
X
θ
(p
θ
θ
log p
θ
) =
X
p (
θ
log p
θ
)
θ
log p
θ
+
X
p
2
θ
log p
θ
log p
θ+∆θ
L
N
(θ, θ) = E
ˆp
[log p
θ
] + E
ˆp
[−∇log p
θ
]
+
λ
2
θ
E
p
θ
−∇
2
log p
θ
θ.
θ
θ
L
N
(θ, θ) = 0
θ = θ
t+1
θ
t
θ
t+1
= θ
t
+
E
p
θ
−∇
2
log p
θ

1
E
ˆp
[−∇log p
θ
] .
p
θ
ˆp
f(x)
x
i
x
j
f(x) =
(x
1
x
2
)
2
+ α
x
2
1
+ y
2
1
α α
θ
θ
0
p(θ) θ
0
θ
0
θ
0
m
n U(
1
m
,
1
n
)
W
i,j
U (
6
m + n
,
6
m + n
).
g
g
1/
m
k
m
m
i
c
i
c b
(b) = c
x
x
p(y | x) = N(y | w
T
x + b, 1)
β
k
δ
J(θ)
{J
(0)
, . . . , J
(n)
}
J
(0)
J
(n)
J(θ)
J
(i)
J
(i+1)
θ
J(θ)
(i)
(θ) = E
θ
∼N(θ
,θ
(i)2
)
J(θ
)
J(θ) = θ
θ
J
(i)